7 research outputs found

    Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar

    Get PDF
    Automatic machine learning is an important problem in the forefront of machine learning. The strongest AutoML systems are based on neural networks, evolutionary algorithms, and Bayesian optimization. Recently AlphaD3M reached state-of-the-art results with an order of magnitude speedup using reinforcement learning with self-play. In this work we extend AlphaD3M by using a pipeline grammar and a pre-trained model which generalizes from many different datasets and similar tasks. Our results demonstrate improved performance compared with our earlier work and existing methods on AutoML benchmark datasets for classification and regression tasks. In the spirit of reproducible research we make our data, models, and code publicly available.Comment: ICML Workshop on Automated Machine Learnin

    BugDoc: Iterative debugging and explanation of pipeline

    Get PDF
    peer reviewedApplications in domains ranging from large-scale simulations in astrophysics and biology to enterprise analytics rely on computational pipelines. A pipeline consists of modules and their associated parameters, data inputs, and outputs, which are orchestrated to produce a set of results. If some modules derive unexpected outputs, the pipeline can crash or lead to incorrect results. Debugging these pipelines is difficult since there are many potential sources of errors including: bugs in the code, input data, software updates, and improper parameter settings. We present BugDoc, a system that automatically infers the root causes and derive succinct explanations of failures for black-box pipelines. BugDoc does so by using provenance from previous runs of a given pipeline to derive hypotheses for the errors, and then iteratively runs new pipeline configurations to test these hypotheses. Besides identifying issues associated with computational modules in a pipeline, we also propose methods for: “opportunistic group testing” to identify portions of data inputs that might be responsible for failed executions (what we call), helping users narrow down the cause of failure; and “selective instrumentation” to determine nodes in pipelines that should be instrumented to improve efficiency and reduce the number of iterations to test. Through a case study of deployed workflows at a software company and an experimental evaluation using synthetic pipelines, we assess the effectiveness of BugDoc and show that it requires fewer iterations to derive root causes and/or achieves higher quality results than previous approaches

    DataPrism: Exposing Disconnect between Data and Systems

    Get PDF
    peer reviewedAs data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of data. E.g., a health-monitoring system that is designed under the assumption that weight is reported in lbs will malfunction when encountering weight reported in kilograms. Like software debugging, which aims to find bugs in the source code or runtime conditions, our goal is to debug data to identify potential sources of disconnect between the assumptions about some data and systems that operate on that data. We propose DataPrism, a framework to identify data properties (profiles) that are the root causes of performance degradation or failure of a data-driven system. Such identification is necessary to repair data and resolve the disconnect between data and systems. Our technique is based on causal reasoning through interventions: when a system malfunctions for a dataset, DataPrism alters the data profiles and observes changes in the system's behavior due to the alteration. Unlike statistical observational analysis that reports mere correlations, DataPrism reports causally verified root causes-in terms of data profiles-of the system malfunction. We empirically evaluate DataPrism on seven real-world and several synthetic data-driven systems that fail on certain datasets due to a diverse set of reasons. In all cases, DataPrism identifies the root causes precisely while requiring orders of magnitude fewer interventions than prior techniques

    AlphaD3M: An Open-Source AutoML Library for Multiple ML Tasks

    Get PDF
    peer reviewedWe present AlphaD3M, an open-source Python library that supports a wide range of machine learning tasks over different data types. We discuss the challenges involved in supporting multiple tasks and how AlphaD3M addresses them by combining deep reinforcement learning and meta-learning to construct pipelines over a large collection of primitives effectively. To better integrate the use of AutoML within the data science lifecycle, we have built an ecosystem of tools around AlphaD3M that support user-in-the-loop tasks, including selecting suitable pipelines and developing custom solutions for complex problems. We present use cases that demonstrate some of these features. We report the results of a detailed experimental evaluation showing that AlphaD3M is effective and derives highquality pipelines for a diverse set of problems with performance comparable or superior to state-of-the-art AutoML systems

    AlphaD3M: Machine Learning Pipeline Synthesis

    Get PDF
    peer reviewedWe introduce AlphaD3M, an automatic machine learning (AutoML) system based on meta reinforcement learning using sequence models with self play. AlphaD3M is based on edit operations performed over machine learning pipeline primitives providing explainability. We compare AlphaD3M with state-of-the-art AutoML systems: Autosklearn, Autostacker, and TPOT, on OpenML datasets. AlphaD3M achieves competitive performance while being an order of magnitude faster, reducing computation time from hours to minutes, and is explainable by design

    Escalated Antipredator Mechanisms Of Two Neotropical Marsupial Treefrogs

    No full text
    Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)The sequence and intensity of antipredator mechanisms may be displayed according to the risk of predation. We tested this hypothesis using two species of marsupial treefrogs from Brazil's Atlantic Forest. We observed Gastrotheca recava and G. megacephala displaying nine antipredator mechanisms and three types of defensive calls. These behaviours were displayed in an escalated sequence from motionless (passive behaviour) to biting (the most aggressive behaviour). This diversified set of antipredator mechanisms may be related to the interaction between predator and prey at the local scale. The escalated sequence of defensive behaviours should be considered in future studies on anuran-predator interaction.263237244CNPq [140710/2013-2, 405285/2013-2, 302589/2013-9, 483412/2010-4]Ecology Center at Utah State UniversityCAPES/FAPESCAPESFAPESP [2014/233887]Rufford FoundationHerpetologist's LeagueConselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP

    Neotropical freshwater fisheries : A dataset of occurrence and abundance of freshwater fishes in the Neotropics

    No full text
    The Neotropical region hosts 4225 freshwater fish species, ranking first among the world's most diverse regions for freshwater fishes. Our NEOTROPICAL FRESHWATER FISHES data set is the first to produce a large-scale Neotropical freshwater fish inventory, covering the entire Neotropical region from Mexico and the Caribbean in the north to the southern limits in Argentina, Paraguay, Chile, and Uruguay. We compiled 185,787 distribution records, with unique georeferenced coordinates, for the 4225 species, represented by occurrence and abundance data. The number of species for the most numerous orders are as follows: Characiformes (1289), Siluriformes (1384), Cichliformes (354), Cyprinodontiformes (245), and Gymnotiformes (135). The most recorded species was the characid Astyanax fasciatus (4696 records). We registered 116,802 distribution records for native species, compared to 1802 distribution records for nonnative species. The main aim of the NEOTROPICAL FRESHWATER FISHES data set was to make these occurrence and abundance data accessible for international researchers to develop ecological and macroecological studies, from local to regional scales, with focal fish species, families, or orders. We anticipate that the NEOTROPICAL FRESHWATER FISHES data set will be valuable for studies on a wide range of ecological processes, such as trophic cascades, fishery pressure, the effects of habitat loss and fragmentation, and the impacts of species invasion and climate change. There are no copyright restrictions on the data, and please cite this data paper when using the data in publications
    corecore